巴西专利BR112015017103B1 METHOD AND APPLIANCE FOR CANCELING DATA PRE-SEARCH REQUESTS FOR A LOOP

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
methods and apparatus for canceling given data prefetch requests for a loop. Efficient techniques for reducing cache pollution through the use of prefetch logic that recognizes outputs from software loops or function returns to cancel any pending prefetch request operations are described. the prefetch logic includes a loop data address monitor to determine a data access path based on repeated execution of a memory access instruction in a program loop. Data prefetch logic then speculatively issues the prefetch requests according to the data access forwarding. a stop prefetch circuit is used to cancel pending prefetch requests in response to an identified loop output. the prefetch logic can also recognize a return from a calling function and cancel any pending prefetch request operations associated with the calling function. when prefetch requests are cancelled, demand requests, such as based on load instructions, are not cancelled. this approach to reducing cache pollution uses program flow information to throttle data cache prefetch.
公开号:BR112015017103B1
申请号:R112015017103-6
申请日:2014-01-18
公开日:2022-01-11
发明作者:Matthew M. Gilbert
申请人:Qualcomm Incorporated；
IPC主号:

专利说明:

Field of Invention
[0001] The present invention relates generally to aspects of processing systems and, in particular, to methods and apparatus for reducing cache pollution caused by data prefetching. background
[0002] Many portable products, such as cell phones, portable computers, personal data assistants (PDAs) and others, use a processing system that runs programs such as communications and multimedia programs. A processing system for these products may include multiple processors, complex memory systems including multilevel caches for storing instructions and data, controllers, peripheral devices such as communication interfaces and fixed-function logic blocks configured, for example, in a single chip. At the same time, portable products have a limited power source in the form of batteries, which are often needed to support high-performance operations by the processing system. To increase battery life, it is desirable to perform these operations as efficiently as possible. Many personal computers are also being developed with efficient designs to operate with reduced overall power consumption.
[0003] In order to provide high performance in program execution, the data prefetching that can be used is based on the concept of spatial location of memory references and is generally used to improve processor performance. By prefetching multiple data elements from a cache at addresses that are close to a fetched data element or are related by a forwarded address delta or indirect pointer, and which are likely to be used on future accesses, cache rates lost can be reduced. Cache designs often implement a form of prefetch by fetching a data cache row for an individual data element fetch. Hardware prefetchers can expand on this by speculatively prefetching one or more data cache lines, where the prefetch addressing can be formed based on sequential, progress, or pointer information. Such a hardware prefetcher operation memory-intensive workloads, such as processing a large variety of data, can significantly reduce memory latency. However, data prefetch also has its drawbacks. For example, in a software loop used to process an array of data, a data prefetcher circuit prefetches data to be used in future iterations of the loop, including the last iteration of the loop. However, the data prefetched for the last iteration of the loop will not be used and cache pollution occurs by storing this unused data in cache memory. The problem of cache pollution is compounded when loops are unwound. SUMMARY
[0004] Among its various aspects, the present disclosure recognizes that providing more efficient methods and apparatus for prefetch can improve performance and reduce power requirements in a processor system. For such purposes, one embodiment of the invention addresses a method for canceling prefetch requests. A loop exit situation is identified based on an evaluation of program flow information. Outstanding cache prefetch requests are canceled in response to the identified loop exit situation.
[0005] Another embodiment addresses a method for canceling prefetch requests. The data is speculatively prefetched according to a calling function. Pending data prefetch requests are canceled in response to an output function from the calling function.
[0006] Another modality addresses a device for canceling prefetch requests. A loop data address monitor is configured to determine data access progress based on repeated execution of a memory access instruction in a program loop. Data prefetch logic is configured to speculatively issue prefetch requests according to data access progress. A prefetch stop circuit is configured to cancel pending prefetch requests in response to an identified loop output.
[0007] Another embodiment addresses a non-transient computer-readable medium encoded with computer-readable program data and code. A loop exit situation is identified based on an evaluation of program flow information. Outstanding cache prefetch requests are canceled in response to the identified loop exit situation.
[0008] Another modality addresses a device for canceling prefetch requests. Means are used to determine a data access progress based on the repeated execution of a memory access instruction in a program loop. Means are used to speculatively issue prefetch requests according to data access progress. Means are also used for canceling outstanding prefetch requests in response to an identified loop exit.
[0009] It is understood that other embodiments of the present invention will become readily apparent to those skilled in the art from the following detailed description, wherein various embodiments of the invention are shown and described by way of illustration. As will be appreciated, the invention is capable of other and different embodiments and its various details are susceptible of modification in various other respects, all without departing from the spirit and scope of the present invention. Therefore, the drawings and detailed description are to be regarded as illustrative in nature and not as limiting. BRIEF DESCRIPTION OF THE DRAWINGS
[0010] Various aspects of the present invention are illustrated by way of example, and not by way of limitation, in the accompanying drawings, in which:
[0011] Figure 1 illustrates an exemplary processor system in which an embodiment of the invention may be advantageously employed;
[0012] Figure 2A illustrates a process to cancel outstanding non-demand data prefetch requests upon detecting an end-of-loop deviation; and
[0013] Figure 2B illustrates a process to cancel outstanding non-demand data prefetch requests upon detecting a function return; and
[0014] Figure 3 illustrates a particular embodiment of a handheld device that has a processor complex that is configured to cancel prefetch requests for selected outstanding data to reduce cache pollution. DETAILED DESCRIPTION
[0015] The following detailed description in connection with the accompanying drawings is intended to be a description of various exemplary embodiments of the present invention and is not intended to represent the only embodiments in which the present invention may be practiced. The detailed description includes specific details for the purpose of providing a complete understanding of the present invention. However, it will be apparent to those skilled in the art that the present invention can be practiced without these specific details. In some cases, well-known structures and components are shown in block diagram form in order to avoid obscuring the concepts of the present invention.
[0016] Figure 1 illustrates an exemplary processor system 100, in which an embodiment of the invention is advantageously used. System processor 100 includes a processor 110, a cache system 112, a system memory 114, and an input and output (I/O) system 116. The cache system 112, for example, comprises an instruction cache 1 (Icache) 124, a memory controller 126, and a level 1 data cache (Dcache) 128. The cache system 112 may also include a unified level 2 cache (not shown) or other cache components as desired for a particular deployment environment. System memory 114 allows access to instructions and data not found in Icache 124 or Dcache 128. Note that cache system 112 may be integrated with processor 110 and may also include multiple levels of caches in an arrangement. hierarchical. The I/O system 116 comprises a plurality of I/O devices, such as I/O devices 140 and 142, that interface with the processor 110.
[0017] Embodiments of the invention may suitably be employed in a processor having conditional branch instructions. Processor 110 includes, for example, instruction piping 120, data prefetch logic 121, prediction logic 122, and stacked loop logic 123. Instruction piping 120 consists of a series of phases, such as , a prefetch and fetch phase 130, decode phase 131, instruction issue phase 132, operand fetch phase 133, execution phase 134, such as for executing load (LD) and store (St) instructions ), and completion stage 135. Those skilled in the art will recognize that each stage 130-135 in instruction pipeline 120 may comprise a number of additional pipeline stages, depending on the processor's operating frequency and the complexity of operations required on each stage. . For example, the execution phase 134 may include one or more pipeline phases corresponding to one or more instruction execution phase circuits, such as an adder, a multiplier, logic operations, load and store operations, shift operations. and rotation, and other function circuits of greater or lesser complexity. For example, when a load instruction is executed, it requests data from Dcache 128 and if the requested data is not present in Dcache a fetch request is issued for the next level of cache or system memory. This fetch request is considered a demand request, since it occurs in direct response to the execution of an instruction, in this case, a load instruction.
[0018] A prefetch request is a request that is made in response to program flow information, such as detecting a program loop that has one or more load instructions in the loop with load addresses based on progress, for example. Data prefetch logic 121 uses such program flow information, which can be based on a number of iterations of the detected loop, to more accurately identify a demand usage pattern of the operand addresses of the load instructions before issue a pre-search request. Fill requests are entered when a pattern is detected. Processor 110 may operate to differentiate a demand request from a prefetch request by using an extra flag associated with the request that is monitored in the processor pipeline. This flag can also propagate with the request to the cache where each row cache fill can excellently be identified as either a prefetch or demand fill. Each of the pipeline phases can have varying implementations without departing from the prefetch request cancellation methods and apparatus described herein.
[0019] In order to minimize delays that could occur if the data required by a program is not in the associated Dcache level 1128, the fetch and prefetch phases 130 record program flow information associated with one or more instructions memory accesses that run in a detected program loop. Program information may include an indication from decoding phase 131 that a load instruction has been received and operand address information for the load instruction may be available in a pipeline phase prior to execution, such as operand seek phase 133 or in execution phase 134. Data prefetch logic 121 monitors load addresses as soon as they are available to detect a pattern. After the pattern is determined with an acceptable level of confidence, such as by monitoring load instructions through three or more iterations of a loop, a prefetch request for expected data is issued prior to when the load information is found again in the loop. This speculative prefetch request ensures the necessary data is available in tier 1 Dcache when needed by the execution phase 134. The load and store execution phase 134 is then more likely to access the necessary data directly from tier 1 Dcache without having to wait to access data from higher levels in the memory hierarchy.
[0020] The data prefetch logic 121 may also include a data cache loop data address monitor to determine a data access progress. The data prefetch logic 121 then speculatively issues the prefetch requests with defined operand addresses in accordance with the data access progress. For example, data prefetch logic 121 may include a progress circuit 119 that is configured to monitor repeated executions of a load instruction to determine the difference between the operand address of each execution of the load instruction representing a progress value. Progress circuit 119 may also include an addition function that is configured to sum the determined progress value to the operand address of the most recently executed load instruction to generate the next operand address. In contrast to the progress value as a predicted address, a fetched conditional branch instruction uses branch prediction logic, such as those contained in prediction logic circuit 122, to predict whether the conditional branch will be taken and the branch address. Detour. The fetched branchless instruction proceeds to the decoding phase 131 to be decoded, issued for execution in the issuing instruction phase 132, executed in the execution phase 134, and withdrawn in the completion phase 135.
[0021] The prediction logic circuit 122 comprises a detection logic circuit 146 for monitoring events, a filter 150, and a conditional history table 152. In one embodiment, it is assumed that most conditional branch instructions , usually has its conditions resolved to the same value for most iterations of a software loop.
[0022] Detection logic circuit 146, in one embodiment, acts as a software loop detector that operates based on the dynamic characteristics of conditional branch instructions used in software loops, as described with respect to Figure 2A. Detection logic circuit 146 may also detect call software function outputs, as described in connection with Figure 2B.
[0023] In single-input, single-output software loops, an end-of-loop branch is usually a conditional branch instruction that branches back to the beginning of the software loop for all iterations of the loop except for the last iteration, which exits the software loop. Detection logic circuit 146 may have various modalities for detecting software loops, as described in more detail below and in Patent Application 11/066,508 assigned to the assignee of the present patent application entitled "Suppressing Update of the Branch History Register by Loop-Ending Branches", which is incorporated by reference in its entirety.
[0024] According to one embodiment, detection logic circuit 146 identifies conditional branch instructions, with a branch destination address less than the conditional branch instruction address, and thus considered a retrograde branch, and is assumed to mark the end of a software loop. Since not all late branches are end-of-loop branches, there is some level of imprecision that may need to be explained by additional control mechanisms, for example.
[0025] In addition, as described with respect to figure 2B, a function return instruction (commonly called RET) can be detected. According to one embodiment, the detection of a callback function is adapted to trigger prefetch cancellations of all non-demand prefetch requests. Canceling a prefetch request is also done in response to program flow information, such as detecting a loop exit.
[0026] In another embodiment, an end-of-loop branch can be detected in single loops, recognizing repeated execution of the same branch instruction. By storing the program counter value at the last backward branch instruction in a special-purpose register, and comparing this stored value with the instruction address of the next backward branch instruction, an end-of-loop branch can be recognized when both instruction addresses combine. Since code can include conditional branch instructions within a software loop, determining the end-of-loop branch instruction can become more complicated. In such a situation, multiple special-purpose registers can be instantiated in hardware to store the instruction addresses of each conditional branch instruction. By comparing against all stored values, a match can be determined for the end-of-loop deviation. Typically, loop branches are conditional retrograde direct branches with a fixed offset from the program counter (PC). These types of branches do not require address comparisons to detect a loop output. Instead, once a program loop is detected based on a conditional backward forward branch, the loop end is determined from the branch's predicate resolution. For example, if the predicate resolves to a true condition to return to the loop, then the loop output would be indicated when the predicate resolves to a false condition. In order for there to be no outstanding prefetch, a program loop would have already run a few times to trigger the prefetch hardware. The data prefetch logic 121 requires some warmup demand loads to recognize a pattern before starting the prefetch.
[0027] Also, an end-of-loop branch can be statically marked by a compiler or assembler. For example, in one embodiment, a compiler generates a specific type of branch instruction, either by using a unique opcode or by defining a specially shaped bit field, which is used only for end-of-loop branches. End-of-loop deviation can then be easily detected during the pipeline run, such as during the decoding phase on the pipeline.
[0028] The prediction logic circuit 122 comprises a filter 150, a conditional history table (CHT) 152, and associated monitoring logic. In one embodiment, a monitoring process saves state information from pre-specified condition events that occurred in one or more previous executions of a software loop having a conditional branch instruction that is eligible for prediction. In support of prediction logic circuit 122, filter 150 determines whether a fetched conditional branch instruction has been received and CHT 152 is enabled. An input on the CHT 152 is selected to provide prediction information that is controlled, for example, by the pipeline stages 132-135 as instructions that move through the pipeline.
[0029] CHT entry 152 records the execution history for the fetched instruction eligible for predicted execution. For example, each CHT input may suitably comprise a combination of the count values of the execution status counters and the status bits, which are inputs to the prediction logic. The CHT 152 may also comprise index logic to allow a fetched conditional branch instruction to index into an entry in the CHT 152 associated with the fetched instruction, as multiple conditional branch instructions may exist in a software loop. For example, by counting the number of conditional branch instructions from the top of a software loop, the count can be used as an index to the CHT 152. The prediction logic circuit 122 includes loop counters for counting iterations of software loops, and ensures that running status counters have the opportunity to saturate a specified count value that represents, for example, a strongly unexecuted status. If an execution status counter saturates, the prediction logic is enabled to make a prediction for the branch direction of the associated fetched conditional branch instruction in the next iteration of the loop.
[0030] The prediction logic circuit 122 generates prediction information that is tracked in the instruction issuing phase 132, the operand fetch phase 133, the execution phase 134, and the completion phase 135 in the register issuance. track (Trl) 162, track record operand search 163, track record execution (TrE) 164, track record termination (TRC) 165, respectively. When a backward conditional branch with a fault predicate indicating the end of the loop, or a return function, as detected during the execution phase 134 is in the processor pipeline, a signal of cancel pending prefetch requests 155 is generated. In another embodiment, pending prefetch requests are canceled based on a conditional branch prediction generated by branch prediction logic. Each conditional branch is usually predicted by branch prediction logic to take or not take the conditional branch. For example, when the prediction information indicates that the conditional branch is taken, which in this example continues a program loop, the speculative instruction fetcher fetches instructions over the program loop indicated by the prediction. The prediction information is also coupled to a cancel pending prefetch request logic circuit 141, which may reside in the seek and prefetch circuit 130. The cancel pending prefetch request logic circuit 141 can then speculatively cancel program flow information based on outstanding prefetch requests indicating outstanding prefetch requests that are not needed. For example, the processor can be configured not to cancel outstanding prefetch requests based on a weakly predicted loop output. By canceling one or more pending data prefetch requests, data cache pollution is reduced and power used to handle such pollution is reduced in processor 110. Cancel pending prefetch request signal 155 is coupled to the pipeline 120, as shown in Figure 1 and is accepted by cancel pending prefetch request logic circuit 141 which causes prefetch requests that are pending, except for prefetch requests from demand are cancelled. Also, processor performance is improved by not storing unnecessary data in the data cache which may have avoided data that would have been fetched and now a leak is generated instead.
[0031] Upon reaching execution phase 134, if the execution condition specified for the end-of-loop conditional branch instruction evaluated as opposite to its prediction, any speculative execution of instruction piping over the wrong instruction path is corrected, for example, by releasing the pipeline, and such a correction may include canceling pending prefetches that are associated with the wrong instruction path. For example, in one embodiment, a fix for piping includes releasing instructions on the piping starting at the prediction phase that was made. In an alternative embodiment, the pipeline is released from the start of the fetching phase in which the end-of-loop conditional branch instruction was initially fetched. In addition, proper CHT input can also be corrected after an incorrect prediction.
[0032] Detection circuit 146, acting as a loop detector, operates to detect an end-of-loop deviation. For example, an end-of-loop branch is usually a conditional branch instruction that branches back to the beginning of the loop for all iterations of the loop except for the last iteration that exits the loop. Information regarding each identified loop is passed to filter circuit 150 and upon a loop exit situation, a cancel pending prefetch request logic circuit 141 cancels pending prefetch requests without demand in response to each identified loop output.
[0033] In one embodiment, filter circuit 150, for example, is a cycle counter that provides an indication that a defined number of iterations of a software loop have occurred, such as three iterations of a particular loop. For each iteration of the loop, the filter determines whether a conditional branch instruction is eligible for prediction. If an eligible conditional branch (CB) instruction is in the loop, the execution status of the CB instruction is written to conditional history table (CHT) circuit 152. For example, an execution status counter can be used to write an execution history of previous attempted executions of an eligible CB instruction. An execution status counter is updated in one direction to indicate a conditionally executed CB instruction and in the opposite direction to indicate that the CB instruction was not conditionally executed. For example, a two-bit run status counter can be used where an unexecuted status causes the counter to decrease and an executed status causes the counter to increment. An output status of the execution status counter is, for example, assigned a power of "11" to indicate that previous CB instructions are strongly indicated as having been executed, an output of "10" to indicate that the previous CB instructions are Earlier CBs are indicated to have been poorly executed, an output of "01" to indicate that previous CB instructions are weakly indicated to have been unexecuted, and an output of "00" to indicate that previous CB instructions are strongly indicated as having not been performed. Output run status counter "11" and output "00" would be saturated output values. An execution status counter would be associated with or provide the status of each CB instruction in a detected software loop. However, a particular implementation may limit the number of execution status markers that are used in the implementation and thus limit the number of CB instructions that are predicted. Detection circuit 146 generally resets the run status counters on the first input into a software loop.
[0034] Alternatively, a disable prediction flag can be associated with each CB instruction to be predicted instead of an execution status counter. The disable prediction flag is set to on to disable prediction if an associated CB instruction has been previously determined to be executed. The identification of a previous CB instruction that executed implies that the confidence level for predicting a no-execute situation for the CB instruction would be less than an acceptable level.
[0035] An index counter can also be used with the CHT 152 to determine which CB instruction is being counted or evaluated in the software loop. For example, in a loop having five or more CB instructions, the first CB instruction could have an index of "000" and the fourth eligible conditional branch instruction could have an index of "011". The index represents an address for the CHT 152 to access the stored execution status counter values for the corresponding CB instruction.
[0036] The prediction circuit 122 receives the prediction information for a particular CB instruction, as the execution status counter outputs values, and predicts, during the decoding phase 131 of Fig. 1, for example, that the instruction CB usually strays back to the software loop start and does not predict that a loop exit situation is reached. In one embodiment, the prediction circuit 122 can predict that the condition specified by the CB instruction evaluates to a no-shift state, encodes outputs or drops through the loop. The prediction circuit 122 traces the CB instruction. If a CB instruction is predicted to branch to the beginning of the loop, the prediction information indicates that status. If a CB instruction has been determined not to branch, then a trace circuit generates a cancel pending prefetch request signal and a condition evaluation is done to determine if an incorrect prediction has been made. If an incorrect prediction has been made, the pipeline may also be released, the appropriate CHT 152 run status counters are updated, and in one embodiment the associated CHT input is flagged to indicate that this particular CB instruction should not be predicted from this point. In another embodiment, the prediction logic circuit 122 may also change the pre-specified evaluation criterion at the time of determining the CB instruction was mispredicted, for example, to make the prediction criterion more conservative from this point on.
[0037] It is further recognized that not all loops have similar characteristics. If a particular loop provides poor prediction results, that loop is checked in prediction logic circuit 122 to disable prediction. Similarly, a particular loop may operate predictably under one set of health scenarios and may operate poorly predictably under a different set of health scenarios. In such a case, the recognition of the operating scenarios allows the prediction to be enabled, disabled or enabled but with different evaluation criteria suitable for the operating scenario.
[0038] Figure 2A illustrates a process 200 for canceling outstanding non-demand data prefetch requests upon detecting an end-of-loop deviation. At block 202, the processor's code execution is monitored by a software loop. In decision block 204, a determination is made whether a software loop has been detected. A software loop can be determined, for example, by identifying a backward branch to a location that represents the start of the software loop on a first pass through the software loop, as described above. If no software loop has been identified, process 200 returns to block 202. If a software loop has been identified then process 200 proceeds to block 206. At this point in the code, a first cycle of the software loop has already been executed and the next cycle of the software loop is ready to start.
[0039] On the next cycle of the software loop, at block 206, the processor code is monitored by a CB instruction. In decision step 208, a determination is made whether a CB instruction was detected, for example, during a pipe decoding phase, such as decoding phase 131 of Fig. 1. If no CB instruction was detected, process 200 returns to block 206. If a CB instruction has been detected, process 200 proceeds to decision block 210. In decision block 210, a determination is made whether the conditional branch (CB) instruction has resolved terminate the loop, based on an evaluation of the conditional predicate, for example. There are a number of types of CB instruction evaluations that may have been detected. For example, a first evaluation of the detected CB instruction can be resolved so that the CB instruction is at the end of the software loop, but evaluates to continue loop processing. The backward bypass CB instruction that identified the software loop on the first pass through the software loop is marked by its address location in the code processor, for example. Also, for the case where a specified number of iterations of the software loop have not completed, the CB instruction resolves to loop the processor back to the beginning of the software loop. A second evaluation of the detected CB instruction could resolve that the CB instruction is at the end of the software loop and evaluates to end the software loop. A third evaluation of the detected CB instruction can be resolved that the CB instruction is inside the software loop, but when evaluated as taken or not, the processor code remains in the software loop. Also, a fourth evaluation of the CB instruction could be resolved that the CB instruction is inside the software loop, but when evaluated as taken or not, the processor code exits the software loop. In the fourth evaluation, a CB instruction that is inside the software loop but resolves to a forward branch after the backward branch CB instruction address location is considered to have exited the software loop.
[0040] Returning to decision block 210, if the detected CB instruction does not resolve to exit the software loop, as in the first and third evaluations of the CB instruction, process 200 proceeds to block 212. At block 212, the process 200 continues with normal branch processing and then returns to block 206. If the detected CB instruction has resolved to exit the software loop, as in the second and fourth evaluations of the CB instruction, process 200 proceeds to block 214. At block 214, process 200 cancels outstanding data prefetch requests, except for demand data prefetch requests, processes the CB instruction, and returns to block 202 to start the search. to the next software loop.
[0041] Figure 2B illustrates a process 250 for canceling outstanding non-demand data prefetch requests upon detecting a function return. At block 252, the processor's code execution is monitored for a software function output. Note that the software function can be speculatively performed. For example, speculative execution can occur by a function call in a software loop. In the case of speculative execution of the software function, the software output function, such as the execution of a RET instruction, can also be executed speculatively. At decision block 254, a determination is made whether a software output function has been detected, such as by detecting a return instruction in a processor execution pipeline. If no software output function has been detected, process 250 returns to block 252.
[0042] If a software output function has been detected, process 250 proceeds to decision block 256. In decision block 256, a determination is made as to whether this detected output situation is a return from a routine of interruption. If the detected output is a return from an interrupt routine, then process 250 returns to block 252. If the detected output is not a return from an interrupt routine, process 250 proceeds to block 258. At block 258, process 250 cancels outstanding data prefetch requests, except for demand data prefetch requests, processes the return instruction, and then returns to block 252 to continue monitoring processor code for a software output function.
[0043] Often, manually or through compiler optimizations, a software loop will unwind so that multiple iterations of the loop are executed sequentially. This sequential execution of each unrolled iteration becomes an additional prefetch candidate. In the last iteration of the loop, each unwound candidate can generate unnecessary prefetch requests compounding the problem of prefetch data cache pollution. An embodiment of the invention also applies to loop unwinding, detecting the end of the loop, or return from a function, and canceling all unnecessary prefetch requests from each unwound loop.
[0044] Figure 3 illustrates a particular embodiment of a handheld device 300 that has a processor complex that is configured to cancel selected outstanding data prefetch requests to reduce cache pollution. Device 300 may be a wireless electronic device and include processor complex 310 coupled to system memory 312 with software instructions memory 318. System memory 312 may include system memory 114 of Figure 1. The complex of the processor 310 may include a processor 311, an integrated memory subsystem 314 having a level 1 data cache (L1 Dcache) 222, a level 1 instruction cache (L1 Icache) 326, a cache controller circuit 328, and prediction logic 316. Processor 311 may include processor 110 of Figure 1. On-board memory subsystem 314 may also include a unified level 2 cache (not shown). L1 Icache 326 can include L1 ICACHE 124 of Fig. 1 and L1 Dcache 322 can include L1 Dcache 128 of Fig. 1.
[0045] Integrated memory subsystem 314 may be included in processor complex 310 or may be implemented as one or more separate devices or circuits (not shown) external to processor complex 310. In an illustrative example, processor complex 310 operates in accordance with any of the embodiments illustrated in or associated with Figures 1 and 2. For example, as shown in Figure 3, L1 Icache 326, L1 Dcache 322, and cache controller circuit 328 are accessible within the processor complex 310, and the processor 311 is configured to access data or program instructions stored in the memory of the integrated memory subsystem 314 or in the system memory 312.
[0046] A camera interface 334 is coupled to the processor complex 310 and also coupled to a camera, such as a video camera 336. A display controller 340 is coupled to the processor complex 310 and a display device 342. An encoder/decoder (CODEC) 344 can also be coupled to the processor complex 310. A speaker 346 and a microphone 348 can be coupled to the CODEC 344. A wireless interface 350 can be coupled to the processor complex 310 and a wireless antenna. wire 352 such that wireless data received via antenna 352 and wireless interface 350 can be provided to processor 311.
[0047] Processor 311 may be configured to execute software instructions 318 stored on a non-transient computer-readable medium, such as system memory 312, which are executable to cause a computer, such as processor 311, to execute a program, such as the program process 200 of Figure 2. The software instructions 318 are additionally executable to cause the processor 311 to process instructions that access the memories of the integrated memory subsystem 314 and the memory of the system 312.
[0048] In a particular embodiment, the processor complex 310, the display controller 340, the system memory 312, the CODEC 344, the wireless interface 350, and the camera interface 334 are included in a system device in package or system-on-chip device 304. In a particular embodiment, an input device 356 and a power supply 358 are coupled to system-on-chip device 304. Furthermore, in a particular embodiment, as illustrated in Figure 3, the display device 342, input device 356, speaker 346, microphone 348, wireless antenna 352, video camera 336, and power supply 358 are external to system-on-chip device 304. However, each of the display device 342, the input device 356, the speaker 346, the microphone 348, the wireless antenna 352, the video camera 336, and the power supply 358 can be coupled to a component of the system device. on chip 304, such as an interface or a controller.
[0049] Device 300 according to embodiments described herein may be incorporated into a variety of electronic devices, such as a set-top box, an entertainment unit, a navigation device, a communication device, a personal digital assistant (PDA). ), a fixed data location unit, a mobile data location unit, a mobile phone, a cell phone, a computer, a laptop, tablets, a monitor, a computer monitor, a television, a tuner, a radio, a satellite radio, music player, a digital music player, a portable music player, a video player, a digital video player, a digital video disc (DVD) player, a digital video player portable, any other device that stores and retrieves instructional or computer data, or any combination thereof.
[0050] The various illustrative logic blocks, modules, circuits, elements or components described in connection with the embodiments disclosed herein may be implemented or executed with a general purpose processor, a digital signal processor (DSP), an application integrated circuit (ASIC), a field programmable gate (FPGA) array or other programmable logic components, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor can be a microprocessor, but alternatively, the processor can be any conventional processor, controller, microcontroller, or conventional state machine. A processor may also be implemented as a combination of computing components, for example, a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors together with a DSP core, or any other configuration suitable for a desired application.
[0051] The methods described in connection with the embodiments disclosed herein may be incorporated directly into hardware, into a software module executed by a processor, or a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM, EEPROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of non-transient storage medium known in the art. A non-transient storage medium may be coupled to the processor so that the processor can read information from and write information to the non-transient storage medium. Alternatively, the non-transient storage medium may be integrated into the processor.
[0052] The processor 110 of Figure 1 or the processor 311 of Figure 3, for example, can be configured to execute instructions, including instructions without conditional branching under the control of a program stored on a non-transient computer-readable storage medium either directly associated locally with the processor, such as may be available through an instruction cache, or accessible through an I/O device, such as one of the I/O devices 140 or 142 of Figure 1, for example. The I/O device can also access data residing in a memory device either directly associated locally with the processors, such as Dcache 128, or accessible from the other processor's memory. Non-transient computer readable storage media may include random access memory (RAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), flash memory, read-only memory (ROM), memory-only programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), compact disc (CD), digital video disc (DVD), other types of removable discs, or any other media adequate non-transient storage.
[0053] While the invention is presented in the context of illustrative embodiments for use in processor systems, it will be recognized that a wide variety of implementations can be used by persons of ordinary skill in the art in accordance with the above discussion and the claims that follow below. For example, a fixed-function implementation may also utilize various embodiments of the present invention.

权利要求:
Claims (13)
[0001]
1. Method (200) for canceling non-demand data cache prefetch requests in a processor system (100) comprising a processor (110) having a cache system (112) comprising a data cache (124). ), and having an instruction pipeline (120), the method comprising: determining a data access progress based on repeated execution of a memory access instruction in a program loop; speculatively issue data cache prefetch requests according to data access progress; and identifying (210) a loop output based on an evaluation of program flow information; and the method characterized in that it comprises: canceling (214) data cache prefetch requests which are outstanding non-demand data cache prefetch requests in response to the identified loop output.
[0002]
2. Method according to claim 1, characterized in that the loop output is based on the identification of an end-of-loop branch that evaluates to exit the program loop.
[0003]
3. Method according to claim 1, characterized in that the loop output is based on an incorrect branch prediction which caused speculative instruction fetching and execution to be cancelled.
[0004]
4. Method, according to claim 1, characterized in that identifying the loop output comprises detecting that a conditional branch instruction has decided to terminate the program loop.
[0005]
5. Method, according to claim 1, characterized in that it additionally comprises: detecting that a conditional branch instruction has not resolved to terminate the program loop, and monitoring (202) the loop output.
[0006]
6. Apparatus (110) for canceling non-demand data cache prefetch requests in a processor system (100) comprising a processor (110) having a cache system (112) comprising a data cache (124). ), and having an instruction pipeline (120), the apparatus comprising: a loop data address monitor configured to determine a data access progress based on repeated execution of a memory access instruction in a program loop ; a data prefetch logic (121) configured to speculatively issue data cache prefetch requests in accordance with data access progress; and means for identifying (210) a loop output based on an evaluation of program flow information; and the apparatus characterized in that it comprises: a prefetch stop circuit configured to cancel data cache prefetch requests which are outstanding non-demand data cache prefetch requests in response to the loop output identified.
[0007]
7. Apparatus according to claim 6, characterized in that the loop data address monitor comprises: a progress circuit (119) configured to monitor the repeated execution of the memory access instruction to determine a difference in an operand address for each execution of the memory access instruction, where the difference in the operand address is a progress address value; and an addition function circuit configured to add the progress address value to the operand address of the most recently executed memory access instruction to determine the next operand address.
[0008]
8. Device according to claim 6, characterized in that the loop output identified is based on the identification of an end-of-loop deviation that evaluates to exit the program loop.
[0009]
9. Apparatus according to claim 6, characterized in that the identified loop output is based on an incorrect branch prediction that cancels speculative instruction search and execution.
[0010]
10. Device according to claim 6, characterized in that the loop output identified is based on the detection that a conditional branch instruction has resolved to terminate the program loop.
[0011]
11. Apparatus according to claim 6, characterized in that the prefetch stop circuit is additionally configured to detect that a conditional branch instruction has not resolved to terminate the program loop and in which the program cycle continues until the loop output is identified.
[0012]
12. Apparatus according to claim 6, characterized in that the prefetch stop circuit is additionally configured not to cancel pending prefetch requests based on a weakly predicted loop output.
[0013]
13. Computer-readable memory characterized in that it comprises instructions stored therein, the instructions being computer-executable to carry out the steps of the method as defined in any one of claims 1 to 5.

类似技术:

公开号 | 公开日 | 专利标题

BR112015017103B1|2022-01-11|METHOD AND APPLIANCE FOR CANCELING DATA PRE-SEARCH REQUESTS FOR A LOOP

JP5799465B2|2015-10-28|Loop buffer learning

EP2467776B1|2019-05-15|Methods and apparatus to predict non-execution of conditional non-branching instructions

TWI564707B|2017-01-01|Apparatus,method and system for controlling current

TWI503744B|2015-10-11|Apparatus, processor and method for packing multiple iterations of a loop

JP2011100454A|2011-05-19|System and method for using branch mis-prediction buffer

US20120079255A1|2012-03-29|Indirect branch prediction based on branch target buffer hysteresis

US8578141B2|2013-11-05|Loop predictor and method for instruction fetching using a loop predictor

US9311094B2|2016-04-12|Predicting a pattern in addresses for a memory-accessing instruction when processing vector instructions

US9003171B2|2015-04-07|Page fault prediction for processing vector instructions

US10853075B2|2020-12-01|Controlling accesses to a branch prediction unit for sequences of fetch groups

JP5485129B2|2014-05-07|System and method for handling interrupts in a computer system

WO2021133469A1|2021-07-01|Controlling accesses to a branch prediction unit for sequences of fetch groups

同族专利:

公开号 | 公开日

EP2946286A1|2015-11-25|

WO2014113741A1|2014-07-24|

JP2016507836A|2016-03-10|

ES2655852T3|2018-02-21|

KR101788683B1|2017-10-20|

JP6143886B2|2017-06-07|

TW201443645A|2014-11-16|

CN105074655A|2015-11-18|

BR112015017103A2|2017-07-11|

US9519586B2|2016-12-13|

HUE035210T2|2018-05-02|

TWI521347B|2016-02-11|

CN105074655B|2018-04-06|

US20140208039A1|2014-07-24|

KR20150110588A|2015-10-02|

EP2946286B1|2017-10-25|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US5996061A|1997-06-25|1999-11-30|Sun Microsystems, Inc.|Method for invalidating data identified by software compiler|

US6430680B1|1998-03-31|2002-08-06|International Business Machines Corporation|Processor and method of prefetching data based upon a detected stride|

US6260116B1|1998-07-01|2001-07-10|International Business Machines Corporation|System and method for prefetching data|

US6611910B2|1998-10-12|2003-08-26|Idea Corporation|Method for processing branch operations|

US6446143B1|1998-11-25|2002-09-03|Compaq Information Technologies Group, L.P.|Methods and apparatus for minimizing the impact of excessive instruction retrieval|

US6321330B1|1999-05-28|2001-11-20|Intel Corporation|Each iteration array selective loop data prefetch in multiple data width prefetch system using rotating register and parameterization to avoid redundant prefetch|

US6799263B1|1999-10-28|2004-09-28|Hewlett-Packard Development Company, L.P.|Prefetch instruction for an unpredicted path including a flush field for indicating whether earlier prefetches are to be discarded and whether in-progress prefetches are to be aborted|

US6775765B1|2000-02-07|2004-08-10|Freescale Semiconductor, Inc.|Data processing system having instruction folding and method thereof|

US20020144054A1|2001-03-30|2002-10-03|Fanning Blaise B.|Prefetch canceling based on most recent accesses|

JP3683248B2|2002-10-22|2005-08-17|富士通株式会社|Information processing apparatus and information processing method|

US7194582B1|2003-05-30|2007-03-20|Mips Technologies, Inc.|Microprocessor with improved data stream prefetching|

US7526604B1|2004-08-09|2009-04-28|Nvidia Corporation|Command queueing speculative write prefetch|

US7587580B2|2005-02-03|2009-09-08|Qualcomm Corporated|Power efficient instruction prefetch mechanism|

US8589666B2|2006-07-10|2013-11-19|Src Computers, Inc.|Elimination of stream consumer loop overshoot effects|

US7917701B2|2007-03-12|2011-03-29|Arm Limited|Cache circuitry, data processing apparatus and method for prefetching data by selecting one of a first prefetch linefill operation and a second prefetch linefill operation|

US7640420B2|2007-04-02|2009-12-29|Intel Corporation|Pre-fetch apparatus|

GB0722707D0|2007-11-19|2007-12-27|St Microelectronics Res & Dev|Cache memory|

US8479053B2|2010-07-28|2013-07-02|Intel Corporation|Processor with last branch record register storing transaction indicator|

US8661169B2|2010-09-15|2014-02-25|Lsi Corporation|Copying data to a cache using direct memory access|

US9009414B2|2010-09-21|2015-04-14|Texas Instruments Incorporated|Prefetch address hit prediction to reduce memory access latency|FR2652697A1|1989-10-03|1991-04-05|Sgs Thomson Microelectronics|DIGITAL DATA EXTRACTOR IN A VIDEO SIGNAL.|

US9424046B2|2012-10-11|2016-08-23|Soft Machines Inc.|Systems and methods for load canceling in a processor that is connected to an external interconnect fabric|

US9348754B2|2012-10-11|2016-05-24|Soft Machines Inc.|Systems and methods for implementing weak stream software data and instruction prefetching using a hardware data prefetcher|

CN104133691B|2014-05-05|2016-08-31|腾讯科技（深圳）有限公司|Accelerate the method and device started|

US20160283243A1|2015-03-28|2016-09-29|Yong-Kyu Jung|Branch look-ahead instruction disassembling, assembling, and delivering system apparatus and method for microprocessor system|

CN107710153B|2015-07-09|2022-03-01|森蒂彼得塞米有限公司|Processor with efficient memory access|

US10275249B1|2015-10-15|2019-04-30|Marvell International Ltd.|Method and apparatus for predicting end of loop|

US10528352B2|2016-03-08|2020-01-07|International Business Machines Corporation|Blocking instruction fetching in a computer processor|

US10175987B2|2016-03-17|2019-01-08|International Business Machines Corporation|Instruction prefetching in a computer processor using a prefetch prediction vector|

GB2572954B|2018-04-16|2020-12-30|Advanced Risc Mach Ltd|An apparatus and method for prefetching data items|

US10649777B2|2018-05-14|2020-05-12|International Business Machines Corporation|Hardware-based data prefetching based on loop-unrolled instructions|

US10884749B2|2019-03-26|2021-01-05|International Business Machines Corporation|Control of speculative demand loads|

US10963388B2|2019-06-24|2021-03-30|Samsung Electronics Co., Ltd.|Prefetching in a lower level exclusive cache hierarchy|

CN110442382B|2019-07-31|2021-06-15|西安芯海微电子科技有限公司|Prefetch cache control method, device, chip and computer readable storage medium|

法律状态:
2018-11-13| B06F| Objections, documents and/or translations needed after an examination request according [chapter 6.6 patent gazette]|

2020-02-27| B06U| Preliminary requirement: requests with searches performed by other patent offices: procedure suspended [chapter 6.21 patent gazette]|

2021-10-26| B09A| Decision: intention to grant [chapter 9.1 patent gazette]|

2022-01-11| B16A| Patent or certificate of addition of invention granted [chapter 16.1 patent gazette]|Free format text: PRAZO DE VALIDADE: 20 (VINTE) ANOS CONTADOS A PARTIR DE 18/01/2014, OBSERVADAS AS CONDICOES LEGAIS. |

优先权:

申请号 | 申请日 | 专利标题

US13/746,000|2013-01-21|

US13/746,000|US9519586B2|2013-01-21|2013-01-21|Methods and apparatus to reduce cache pollution caused by data prefetching|

PCT/US2014/012152|WO2014113741A1|2013-01-21|2014-01-18|Methods and apparatus for cancelling data prefetch requests for a loop|

[返回顶部]